Introduction to R and RStudio

Raphael Morsomme

Welcome

Meet the instructor

Headshot of Raphael Morsomme

  • Ph.D. candidate in statistics at Duke University
  • Research focus: stochastic epidemic models and breast cancer overdiagnosis
  • Fan of Formula 1 and model building
  • UCM alum of 2018

Objectives of the workshop

  • to introduce you to R and RStudio
  • to give you the tools to conduct the data analysis portion of your research project in R

Expectations

  • What background is assumed for the workshop? Everyone is welcome.
  • Will we be doing coding? Yes. We will use R.
  • What if I have never coded? I do not expect participants to have any experience with R or any other programming language.
  • Will we learn the mathematical theory of statistics? No. We will only learn to visualize and implement simple statistical models in R.

Outline

  • data analysis
  • R and RStudio
  • manipulating numbers, vectors and dataframes
  • visualization
  • modeling
  • visualizing models

Data analysis

What is data analysis?

Data analysis is a process of inspecting, cleansing, transforming, and modelling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. (…) In today’s business world, data analysis plays a role in making decisions more scientific and helping businesses operate more effectively.”

Source: Wikipedia

Examples of data analyses in practice

R and RStudio

Why R

  • free and open-source,
  • large community of users (stackoverflow)
  • used in almost every industry
  • go-to programming language for statisticians, and
  • great tools for visualization

Getting started

rstudio

Getting started with RStudio

Manipulating numbers

Using R as a calculator

Type the following in the Console (lower left corner in RStudio)

5 + 7
[1] 12

and press (ctrl/command + enter) to run the line of code.

Instead of typing, you can also simply copy and paste the code in the console.

Comments

What follows a “#” is a comment

2 ^ 10 # exponent
[1] 1024

The part 2 ^ 10 is run by R and the part # exponent is ignored by R.

Comments are useful for communicating with collaborators.

Calculating more complex expressions

(3 + 8) / (2 + 8 ^ 1.5)
[1] 0.4466567
((3 + 8) / (2 + 8 ^ 1.5))^(-5.65 / 4)^7
[1] 8445.545

The sky is the limit.

Assigning values

Suppose you want to use the number

((3 + 8) / (2 + 8 ^ 1.5))^(-5.65 / 4)^7
[1] 8445.545

for different purposes, e.g. multiply it, divide it, raise it to different powers, etc.

We can simply store this number by assign it to some object with the command =.

x = ((3 + 8) / (2 + 8 ^ 1.5))^(-5.65 / 4)^7

The object x is the number I wanted to save.

I can print the value of x

x
[1] 8445.545

I can multiply it, divide it, raise it to different powers

5*x
[1] 42227.72
x/2.5
[1] 3378.218
x^3
[1] 602397282682

Reconstructing x

Let us use what we just learned to reconstruct x from the numerator, denominator and exponent.

numer = 3 + 8
denom = 2 + 8 ^ 1.5
expon = (-5.65 / 4)^7

x_reconstructed = (numer / denom) ^ expon

The environment

Check the environment in the upper right corner. It contains the objects that you have created.

you should see the objects denom, expon, numer, x and x_reconstructed.

Note that x and x_reconstructed have the same value!

Simple operations

Here is a list of R commands for common mathematical operations

sin(2) # sine
[1] 0.9092974
cos(2) # cosine
[1] -0.4161468
log(10) # natural logarithm
[1] 2.302585
exp(4) # Euler's constant e (2.718) raised to the fourth power
[1] 54.59815
exp(log(10)) # nested functions
[1] 10

On the previous slide, note that the last line of code is equivalent to the following two lines

tmp = log(10)
exp(tmp)
[1] 10

Manipulating vectors

Vectors

A vector of length \(k\) is simply a sequence of \(k\) numbers.

In R, we create vectors with the c command

c(2, 5, 5, 8, 0)
[1] 2 5 5 8 0

Assigning a vector

I can of course assign a vector to an object

x = c(2, 5, 5, 8, 0) # replace previous object x
x
[1] 2 5 5 8 0

Overwriting objects

Note that this piece of code replaced the previous object x present in the environment. The previous value of x is now lost.

Operations on vectors

Here is a list of R commands for common operations on vectors

1.5 * x + 2
[1]  5.0  9.5  9.5 14.0  2.0
mean(x)
[1] 4
median(x)
[1] 5
max(x)
[1] 8
min(x)
[1] 0
summary(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0       2       5       4       5       8 
var(x)
[1] 9.5
sd(x)
[1] 3.082207

Manipulating dataframes

Dataframes

A dataframe is simply a collection of vectors of same length.

  • Vectors: columns

  • Dataframes: rectangles

Libraries

Packages are user-created collections of new functions for R.

Packages need to be

  • installed on your machine once
  • loaded every session.

The package ggplot2 contains the dataframe mpg.

# install.packages("ggplot2") # install ggplot2; run only once
library(ggplot2) # load ggplot2; run every session

The mpg dataframe

For simplicity, I assign the dataframe mpg to the object d

d = mpg

I print its first few rows for inspection with the command head.

head(d)
# A tibble: 6 x 11
  manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
  <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa~
2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa~
3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa~
4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa~
5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa~
6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa~

Dimensions of mpg

The commands nrow and ncol respectively provide the number of rows (observations) and columns (variables) of a dataframe.

nrow(d)
[1] 234
ncol(d)
[1] 11

Accessing variables

I use the command $ to access a variable

d$displ
  [1] 1.8 1.8 2.0 2.0 2.8 2.8 3.1 1.8 1.8 2.0 2.0 2.8 2.8 3.1 3.1 2.8 3.1 4.2
 [19] 5.3 5.3 5.3 5.7 6.0 5.7 5.7 6.2 6.2 7.0 5.3 5.3 5.7 6.5 2.4 2.4 3.1 3.5
 [37] 3.6 2.4 3.0 3.3 3.3 3.3 3.3 3.3 3.8 3.8 3.8 4.0 3.7 3.7 3.9 3.9 4.7 4.7
 [55] 4.7 5.2 5.2 3.9 4.7 4.7 4.7 5.2 5.7 5.9 4.7 4.7 4.7 4.7 4.7 4.7 5.2 5.2
 [73] 5.7 5.9 4.6 5.4 5.4 4.0 4.0 4.0 4.0 4.6 5.0 4.2 4.2 4.6 4.6 4.6 5.4 5.4
 [91] 3.8 3.8 4.0 4.0 4.6 4.6 4.6 4.6 5.4 1.6 1.6 1.6 1.6 1.6 1.8 1.8 1.8 2.0
[109] 2.4 2.4 2.4 2.4 2.5 2.5 3.3 2.0 2.0 2.0 2.0 2.7 2.7 2.7 3.0 3.7 4.0 4.7
[127] 4.7 4.7 5.7 6.1 4.0 4.2 4.4 4.6 5.4 5.4 5.4 4.0 4.0 4.6 5.0 2.4 2.4 2.5
[145] 2.5 3.5 3.5 3.0 3.0 3.5 3.3 3.3 4.0 5.6 3.1 3.8 3.8 3.8 5.3 2.5 2.5 2.5
[163] 2.5 2.5 2.5 2.2 2.2 2.5 2.5 2.5 2.5 2.5 2.5 2.7 2.7 3.4 3.4 4.0 4.7 2.2
[181] 2.2 2.4 2.4 3.0 3.0 3.5 2.2 2.2 2.4 2.4 3.0 3.0 3.3 1.8 1.8 1.8 1.8 1.8
[199] 4.7 5.7 2.7 2.7 2.7 3.4 3.4 4.0 4.0 2.0 2.0 2.0 2.0 2.8 1.9 2.0 2.0 2.0
[217] 2.0 2.5 2.5 2.8 2.8 1.9 1.9 2.0 2.0 2.5 2.5 1.8 1.8 2.0 2.0 2.8 2.8 3.6

This is simply a vector. I can assign it to a object and manipulate it.

x = d$displ
mean(x)
[1] 3.471795
summary(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.600   2.400   3.300   3.472   4.600   7.000 
var(x)
[1] 1.669158
var(d$displ) # equivalent to the previous line
[1] 1.669158

Break

Visualization

Visualization

“The simple graph has brought more information to the data analyst’s mind than any other device.” — John Tukey

“The greatest value of a picture is when it forces us to notice what we never expected to see.” — John Tukey

Types of variables

Roughly speaking, there are two types of variables

  • numerical (height, age, number of siblings, etc)
  • categorical (eye color, level of education, place of birth, etc)

Histograms

A histogram summarizes the distribution of a continuous variable. Simply use the command hist.

hist(x = d$cty)

Title, axis label and number of bins

We can improve the figure by adding a title, improving the x-axis label and changing the number of breaks

hist(
  x = d$cty,
  xlab = "City Miles per Gallon",
  main = "Fuel Consumption in City",
  breaks = 15
  )

Help files

The help file of a R function describes

  • what a function does,
  • what arguments are available (e.g. xlab, main, breaks)
  • a few examples (often)

To access the help file of a function, simple use the command help

help(hist)

Boxplots

Boxplots are a more compact way than histograms to summarizes the distribution of a continuous variable. To make a boxplot, simply use the command boxplot.

boxplot(x = d$hwy)

Figure editing

Again, we can add a title and a y-axis label to improve the figure.

boxplot(
  x = d$hwy, 
  ylab = "Miles per Gallon",
  main = "Fuel Consumption on Highway"
  )

Scatterplot

Scatterplot are used to visualize the relationship between two continuous variables

plot(x = d$cty, y = d$hwy)

Figure editing

plot(
  x = d$cty,
  y = d$hwy,
  main = "Fuel Consumption of 38 Car Models",
  xlab = "City Miles per Gallon",
  ylab = "Highway Miles per Gallon"
  )

Tables

To summarize a categorical variable, we can use a table

table(d$drv) # 4-wheel, front-wheel, rear wheel drive

  4   f   r 
103 106  25 

Table for numerical variable?

What happens if we make a table for a numerical variable?

table(d$hwy) # meaningful?

12 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 41 
 5  2 10  7 31 10 13 11  2  7  7 13 15 32 14  7 22  4  7  4  2  1  2  2  1  1 
44 
 2 

This is not helpful! A histogram would be much better.

R is only a tool

R is only a tool. It will let you use it to do stupid things!

Two-way tables

To look at the relation between two categorical variables, we can use a two-way table

table(d$drv, d$fl) # drive and fuel type
   
     c  d  e  p  r
  4  0  2  6 20 75
  f  1  3  1 25 76
  r  0  0  1  7 17

Modeling (requires statistics)

t-test

Is the fuel consumption in the city and on the highway the same?

t.test(d$cty, d$hwy) # two-sample t-test

    Welch Two Sample t-test

data:  d$cty and d$hwy
t = -13.755, df = 421.79, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -7.521683 -5.640710
sample estimates:
mean of x mean of y 
 16.85897  23.44017 

Scientific notation

e-16 is equivalent to the scientific notation \(10^{-16}\).

Paired t-test

Actually, we should use a paired t-test in this case.

t.test(d$cty, d$hwy, paired = TRUE) # paired t-test

    Paired t-test

data:  d$cty and d$hwy
t = -44.492, df = 233, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -6.872628 -6.289765
sample estimates:
mean of the differences 
              -6.581197 

Simple linear regression

m_simple = lm(hwy ~ cty, data = d)
summary(m_simple)

Call:
lm(formula = hwy ~ cty, data = d)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.3408 -1.2790  0.0214  1.0338  4.0461 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.89204    0.46895   1.902   0.0584 .  
cty          1.33746    0.02697  49.585   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.752 on 232 degrees of freedom
Multiple R-squared:  0.9138,    Adjusted R-squared:  0.9134 
F-statistic:  2459 on 1 and 232 DF,  p-value: < 2.2e-16

Multiple linear regression

m_multiple = lm(hwy ~ cty + displ, data = d)
summary(m_multiple)

Call:
lm(formula = hwy ~ cty + displ, data = d)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.3124 -1.2423  0.0053  1.0296  4.1243 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.15145    1.21271   0.949    0.343    
cty          1.32914    0.04490  29.602   <2e-16 ***
displ       -0.03432    0.14791  -0.232    0.817    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.756 on 231 degrees of freedom
Multiple R-squared:  0.9138,    Adjusted R-squared:  0.913 
F-statistic:  1224 on 2 and 231 DF,  p-value: < 2.2e-16

Interactions

m_interaction = lm(hwy ~ cty + displ + cty:displ, data = d)
summary(m_interaction)

Call:
lm(formula = hwy ~ cty + displ + cty:displ, data = d)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.2546 -1.0888  0.1424  0.8978  4.1637 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.32675    1.48875   3.578 0.000422 ***
cty          1.03665    0.07795  13.299  < 2e-16 ***
displ       -1.69320    0.39471  -4.290 2.64e-05 ***
cty:displ    0.12029    0.02670   4.505 1.06e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.687 on 230 degrees of freedom
Multiple R-squared:  0.9208,    Adjusted R-squared:  0.9198 
F-statistic: 891.2 on 3 and 230 DF,  p-value: < 2.2e-16

Or simply

m_interaction2 = lm(hwy ~ cty*displ, data = d)
summary(m_interaction2)

Call:
lm(formula = hwy ~ cty * displ, data = d)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.2546 -1.0888  0.1424  0.8978  4.1637 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.32675    1.48875   3.578 0.000422 ***
cty          1.03665    0.07795  13.299  < 2e-16 ***
displ       -1.69320    0.39471  -4.290 2.64e-05 ***
cty:displ    0.12029    0.02670   4.505 1.06e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.687 on 230 degrees of freedom
Multiple R-squared:  0.9208,    Adjusted R-squared:  0.9198 
F-statistic: 891.2 on 3 and 230 DF,  p-value: < 2.2e-16

ANOVA

m_anova = aov(hwy ~ class, data = d)
summary(m_anova)
             Df Sum Sq Mean Sq F value Pr(>F)    
class         6   5683   947.2   83.39 <2e-16 ***
Residuals   227   2578    11.4                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Pairwise t-tests

pairwise.t.test(
  x = d$hwy, # response variable
  g = d$class, # grouping variable
  p.adjust.method   = "bonferroni" # Bonferroni correction to control type-I error rate.
  )

    Pairwise comparisons using t tests with pooled SD 

data:  d$hwy and d$class 

           2seater compact midsize minivan pickup  subcompact
compact    0.59562 -       -       -       -       -         
midsize    1.00000 1.00000 -       -       -       -         
minivan    1.00000 7.1e-06 0.00052 -       -       -         
pickup     3.9e-05 < 2e-16 < 2e-16 0.00011 -       -         
subcompact 0.82210 1.00000 1.00000 2.9e-05 < 2e-16 -         
suv        0.00064 < 2e-16 < 2e-16 0.00335 1.00000 < 2e-16   

P value adjustment method: bonferroni 

Visualizing models

Visualizing simple regression

plot(
  x = d$cty,
  y = d$hwy,
  main = "Fuel Economy of 38 Models of Cars",
  xlab = "City Miles per Gallon",
  ylab = "Highway Miles per Gallon"
  )
abline(m_simple, col = "red") # fitted regression curve

Visualizing ANOVA

boxplot( # complements m5 (ANOVA)
  d$hwy ~ d$class, 
  ylab = "Highway Miles per Gallon",
  xlab = "Type of Car"
  )

Going further

Two excellent books

Introduction to Modern Statistics, available on Openintro

R for Data Science, available on R4DS

Both books are freely available online.